The objective of the present work is to develop a smart keyboard to enable people to be more effective on their mobile devices. A predictive text model has been developed, giving the user of mobile device three options for what the next word might be.
To develop such a model, a large corpus of text documents has been created by merging three different types of english sources: blogs, news and twitts. The raw data, Capstone Data Set, was provided by John Hopkins University and the whole code used for creating this report and the proposed model is available on github.
The research question is centerd on : “How can an efficient text predicitve model be developed on the base of publicly available data such as blogs, news wires and tweets ?”. It then implies that the methodology developed in this work can be replicated in any language, if needed.
The data The data is composed of more that 4 millions documents, the extact total being 4’269’678. The following tables indicates the statistics related to the three main sources, blogs, news and twitts.